Eliminating Noisy Information in Web Pages using featured DOM tree
نویسندگان
چکیده
The exact information retrieval from the Web is now a great challenge for the researchers to device new methodologies for web mining. Due to the massive information on the Web, the size and number appear to be growing rapidly at an exponential rate which is often accompanied by a large amount of noise such as banner advertisements, navigation bars, copyright notices, etc. Although such information items are functionally useful for human viewers and necessary for the web site owners, they often hamper automated information gathering and web data mining. The efficiency of feature extraction and finally classification accuracy are certainly degraded due to the presence of such noisy information. Thus cleaning the web pages before mining becomes critical for improving the mining results. In our work, we focuses on identifying and removing local noises in web pages to improve the performance of mining. We propose a novel and simple idea for the detection and removal of local noises using a new tree structure called featured DOM Tree. A three stage algorithm is proposed in which feature selection is done in the first phase, a featured DOM tree is created in the second phase and noise is marked and pruned in the third phase. The experimental results show that our algorithm outperform in terms of various benchmark measures and an increase in F score and accuracy is obtained as a result of automatic web page classification. General Terms Web content mining. Web page classification.
منابع مشابه
Web Page Performance Enhancement by Removing Noise
Data mining is the procedure of extracting or taking out the information from the huge set of data. Web Mining is an important application of data mining, which is to extract knowledge from Web data including Web documents, hyperlinks, usage logs of web sites, etc. A Web Page contains many blocks such as content blocks, copyrights, privacy notes and advertisements. These blocks like advertiseme...
متن کاملEliminating the Noise from Web Pages using Page Replacement Algorithm
Data mining is the process of mining information from the large set of data. It further has many categories like text mining web usage mining and web content mining. There are many types of algorithm which are used in web mining i.e. Visitor method, Dom tree and least recent used algorithm. Visitor and Dom tree is the complex and time consuming method. Least Recent Used algorithm is less time c...
متن کاملData Extraction using Content-Based Handles
In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...
متن کاملAnalyzing new features of infected web content in detection of malicious web pages
Recent improvements in web standards and technologies enable the attackers to hide and obfuscate infectious codes with new methods and thus escaping the security filters. In this paper, we study the application of machine learning techniques in detecting malicious web pages. In order to detect malicious web pages, we propose and analyze a novel set of features including HTML, JavaScript (jQuery...
متن کاملRetrieve Information Using Improved Document Object Model Parser Tree Algorithm
The Data mining refers to mining the useful information from raw data or unstructured data. Whereas in web content mining the data is scattered or unstructured on web pages. Some time the user wants to retrieve only fix kind of data, but the unwanted data is also retrieved. The unnecessary information can be removed with this proposed work. The DOM Parser Tree Algorithm to filter the web pages ...
متن کامل